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ABSTRACT 

We present a new workflow for differentially-private publi- 
cation of graph topologies. First, we produce differentially- 
private measurements of interesting graph statistics using 
our new version of the PINQ programming language. 
Weighted PINQ, which is based on a generalization of 
differential privacy to weighted sets. Next, we show how 
to generate graphs that fit any set of measured graph statis- 
tics, even if they are inconsistent (due to noise), or if they 
are only indirectly related to actual statistics that we want 
our synthetic graph to preserve. We do this by combining 
the answers to Weighted PINQ queries with an incremen- 
tal evaluator (Markov Chain Monte Cai'lo (MCMC)) that al- 
lows us to synthesize graphs where the statistic of interest 
aligns with that of the protected graph. This paper presents 
our preliminary results; we show how to cast a few graph 
statistics (degree distribution, edge multiplicity, joint degree 
distribution) as queries in Weighted PINQ, and then present 
experimental results of synthetic graphs generated from the 
answers to these queries. 

1. INTRODUCTION 

Despite recent advances in the design of query lan- 
guages (8J[T2J that support differential-privacy [2], sev- 
eral emerging areas remain underserved by these lan- 
guages. Perhaps the most notable is social graph anal- 
ysis, where edges in the graph reflect private informa- 
tion between the nodes. Informally, differential privacy 
guarantees that the presence or absence of individual 
records (edges) is hard to infer from the analysis output. 
Because most interesting social graph analyses reflect 
paths through the graph, providing differential privacy 
in this context is hindered by the difficulty of isolating 
the influence of a single edge, and likewise masking its 
presence or absence. 

1.1 Exploiting non-uniform noise... 

To illustrate this difficulty, consider the problem of 
producing a differentially private measure of a funda- 
mental graph statistic, the joint degree distribution (JDD). 
The JDD can be thought of as a two-dimensional array 
where entry (di, ^2) gives a count of the number of edges 



incident on nodes with degree di and degree ^2. 

If the maximum node degree in the graph was dmax, 
a naive application of sensitivity-based differential pri- 
vacy would require that noise proportional to 4dmax + 1 
is added to the count at each entry of the JDD; this 
would protect privacy even in the worst case, where 
di — d2 — (imax- While this is great for privacy, the 
downside is that when dmax is large, slathering on noise 
like this ruins the accuracy of our results. 

Is it really necessary to add so much noise to all 
entries of the JDD? Happily, the answer is no. The 
analysis of [13' shows that the noise required to protect 
the privacy of the JDD can be non-uniform: for each 
(^1,^2) entry of the JDD, it suffices to add noise pro- 
portional to 4max((ii, d2)- Indeed, one consequence of 
our work is that these sorts of non-uniformities exist in 
many graph analysis problems, e.g., counting triangles, 
motifs, etc.. In each case, features on low degree ver- 
tices are measured very accurately, with less accurate 
measurements on higher degree vertices. 



1.2 ...without custom technical analyses! 

We can significantly improve the accuracy of differ- 
entially private graph measurements by exploiting op- 
Howe ver, 



portunities to apply noise non-uniformly 13 



each new measurement algorithm requires a new non- 
trivial privacy analysis, that can be quite subtle and 
error prone. Moreover, existing languages like PINQ 
are no help, as they are explicitly designed not to rely 
on custom analyses by untrustworthy users. 

In this work, we use a new declarative programming 
language. Weighted PINQ, designed to allow the user 
to exploit opportunities for non-uniformity while still 
automatically imposing differential privacy guarantees. 
The key idea behind the language is the following obser- 
vation: rather than add noise non-uniformly to various 
aggregates of integral records, we add noise uniformly 
to aggregates of records whose weights have been non- 
uniformly scaled down. The operators in the language 
are perfectly positioned to identify problem records and 
scale their weight (down), rather than poison the anal- 
ysis by increasing the noise for all records. 



1.3 A new workflow for graph release. 

We describe a three-phase workflow that can generate 
synthetic graphs that match any properties of the true 
(secret) graph that we can measure in Weighted PINQ: 

Phase 1. Measure the secret graph. First, we cast 
various graph properties as queries in Weighted PINQ, 
and produce differentially private (noisy) measurements 
of the secret graph. Some of the queries will have natu- 
ral interpretations, e.g., a degree complementary cumu- 



lative distribution function (CCDF) query (Section 4.1). 
Other queries will be only indirectly meaningful, in that 
they constrain the set of graphs which could have led 
to them but they will not explicitly reveal the quantity 



of interest (Section 6.4 1 



At the end of Phase 1, we discard the secret graph 
and proceed using only the differentially private mea- 
surements taken from it. When these measurements are 
sufficient, we can report them and stop. However, we 
can go much further using probabilistic inference 14 , 
i.e., by fitting a random graph to our measurements. 
While our measurements are noisy, they constrain the 
set of plausible graphs that could lead to them. More- 
over, the properties of this set of plausible graphs may 
be very concentrated, even if we did not measure these 
properties directly. For example, the set of graphs fit- 
ting both our noisy JDD measurements and our (very 
accurate) degree CCDF measurements should have as- 
sortativity very close to the secret graph, even if they 
have little else in common. Thus, we proceed as follows: 

Phase 2. Create a "seed" synthetic graph. Our 

measurements typically contain queries about degrees 
that, cleaned up, are sufficient to seed a simple random 
graph generator. We do so, as a very primitive approx- 
imation to the sort of graph we would like to release. 

Phase 3. Correct the synthetic graph. Start- 
ing from the seed synthetic graph produced in the sec- 
ond phase, we search for synthetic graphs whose accu- 
rate (noise-free) answers to our Weighted PINQ queries 
are similar to the noisy measurements we took in the 
first phase. We perform this search with Markov-Chain 
Monte Carlo (MCMC), a traditional approach from ma- 
chine learning used to search for datasets matching prob- 
abilistic observations, in our case graphs matching noisy 
answers to weighted PINQ queries. 

Generally, MCMC involves proposing a new candi- 
date (synthetic graph) at each iteration, and re-evaluating 
the Weighted PINQ queries on this graph to see if the 
fit has improved. While this could be time consum- 
ing, each MCMC iteration is designed to introduce only 
small updates to the candidate graph {e.g., swapping 
one edge for another) , and our implementation can effi- 
ciently process small changes to its input (Section 3.2). 
Depending on the query's complexity, we can process 
up to tens of thousands of candidate graphs per second. 



1.4 Our results and roadmap. 

This technical report details our new workflow, and 
reports on our initial experiences using it. We focus on 
a small number of interesting graph statistics - namely, 
the degree distribution, edge multiplicity, assortativity, 
and JDD ~ and show how our workflow can generate 
synthetic graphs that match these statistics. But these 
statistics are by no means the end of the story; we have 
developed additional queries measuring other statistics 
of interest, e.g., clustering coefficient, triangles, graph 
motifs, etc., and are evaluating their efflcacy. 

Roadmap. We start in Section[2]with an overview of 
Weighted PINQ, move on to providing background on 
MCMC and describe the details of our implementation 
in Section [3] In Section |4] we present Weighted PINQ 
queries for first-order graph statistics, and experimental 
results of seed synthetic graphs that were produced us- 
ing only these statistics {i.e., in Phase 2). In Section [S] 
we show how MCMC can be used to "correct" edge mul- 
tiplicities in these synthetic graphs, and in Section[6]we 
show how we used an indirect measurement of the JDD 
to "correct" the assortativity of our synthetic graphs. 
In the appendix we provide details on two additional 
algorithms that we used to generate our seed synthetic 
graphs, and reproduce |l3]'s result that it suffices to 
add non-uniform noise to the JDD. 

2. WEIGHTED DP AND WEIGHTEDPINQ 

We model a dataset ^ as a weighting of the records: 
A : 13 — > E, where A{r) represents the weight of r in 
A. Traditionally this number would be a non-negative 
integer, but we allow any real number. We provide the 
following generalization of differential privacy: 

Definition 2.1. A randomized computation M pro- 
vides e-differential privacy if for any weighted datasets 
A and B, and any set of possible outputs S C Range{M) , 

Vt[M{A) e 5] < Vv[M{B) e 5] X exp(e x ||yl - B\\) . 

where \\A-B\\ = Y.x\M^) - B{x)\. 

This definition is equivalent to the standard definition 
when the range of A and B are assumed integral, and so 
only introduces the additional constraint that on frac- 
tional datasets, M must respect even slighter changes. 
If A and B have only a sliver of a weight in difference, 
the output probabilities under M must be that much 
closer, and are not excused an entire factor of exp(e). 

Node vs. edge privacy. Our secret datasets con- 
tain directed edges with weight 1.0, where differential 
privacy masks the presence or absence of each edge. 
The set of all outgoing edges from a vertex v can be 
protected by weighting each directed edge by O.S/di,, 
trading some accuracy for a stronger notion of privacy. 
Note that this is not yet as strong as "node privacy", 



where the existence of each vertex is masked; Ashton 
Kutcher's existence on Twitter would not be masked, 
though the shame of each of his followers would be. 

For our examples, we will write weighted datasets as 
sets of weighted records, where each is a pair of record 
and weight. We will reserve real numbers with decimal 
points for weights in our examples to avoid confusion. 
For example, 

^ = {(V',1.0),(V,2.1)} and i? = {("0", 1.4)} . 

In this example we have A{"y") = 2.1 and i?("l") = 
0.0. 

PINQ. Privacy Integrated Queries (PINQ) [s] is a 

declarative programming language over integral datasets 
that guarantees differential privacy for every program 
written in the language. PINQ includes aggregations 
{e.g., NoisyCount, which counts the records in a data 
set, plus Laplace noise) and transformations {e.g., Se- 
lect, GroupBy, Join) which transform that dataset while 
managing their impact on the final privacy guarantee. 

Weighted PINQ deviates from PINQ as follows: 

• Weighted PINQ's NoisyCount now returns a dic- 
tionary from records to noised weights, rather than 



one noised count (Section 2.1). 



• Transformations in PINQ that required either scal- 
ing up the noise or the privacy parameter e {e.g., 
SelectMany, GroupBy, Jo in), now scale down the 



weights associated with records (Section 2.2). 



There is a new operator to manipulate weights. 



Shave (Section 2.2.2). 



We now present a detailed look at the operators avail- 
able in Weighted PINQ. 

2.1 Aggregations 

Traditional differentially private aggregations such as 
counting can easily be adapted to to weighted datasets; 
rather than count occurrences of each record (plus noise) , 
we simply report their weight (plus noise). Thus, the 
NoisyCount method in Weighted PINQ reports the weight 
of each record with Laplace noise added: 



ComLt{A,e){x) 



A{x) + Laplace(l/e) 



Using arguments identical to those for the unweighted 
case, this provides e-differential privacy. 

Data-parallel aggregations. The original PINQ [s] 
was designed to follow LINQ closely, so that aggrega- 
tions like NoisyCount returned differentially-private ag- 
gregations of the entire subject collection. However, 
recall, that when differentially-private queries are ap- 
plied in a data-parallel manner (to disjoint subsets of 
the data), then the ultimate privacy guarantee depends 
only on the worst of the guarantees of each analysis. 



not its sum. To take advantage of the privacy sav- 
ings offered by data-parallel aggregations, the original 
PINQ had to define the Partition operator. In con- 
trast, the NoisyCount operator in Weighted PINQ has 
data-parallel aggregations built in, and is thus closer to 
the 'histogram query' of (2). To do this, NoisyCount 
now takes in a key selector function, and returns a dic- 
tionary mapping key to count. Importantly, this dictio- 
nary does not expose the set of keys, and when asked 
for a key not observed in the data it returns only noise 
rather than an admission of the key's absence. 

Other aggregations. Both Sum and Average 

are generalized with weighted sums, and the addition 
of Laplace noise with parameter 1/e. The exponen- 
tial mechanism [?] also generalizes, to scoring functions 
that are 1-Lipschitz with respect to their dataset input. 

2.2 Stable Transformations 

PINQ identified an important property of dataset-to- 
dataset transformations called stability: a bound on the 
number of records that might change in the output as 
a result of a change in the input. Thus, if we com- 
bine a stable transformation T with an e-differentially- 
private aggregation M, the composition M{T{-)) is also 
e-differentially private. 

We generalize transformation stability to the weighted 
case, where the transformation T maps weighted sets 
over D to weighted sets over M: 

Definition 2.2. A transformation T : M^ -^ R.^ is 
stable if for any two datasets A and B 

\\T{A)-T{B)\\ < \\A-B\\. (1) 

Remark. PINQ actually defined c-stability, where the 
difference is allowed to grow by a factor of c. While 
the set of c-stable transformations T are exactly those 
for which T/c is stable, we note that "scaling down" the 
dataset by a factor of c is possible only if the dataset is 
weighted. In unweighted datasets, a c-stable transfor- 
mation requires c times as much noise in aggregates of 
its outputs as in aggregates of its inputs to give a cor- 
responding level of differential privacy. In a weighted 
datasets, stable transformations allow the amount of 
noise to remained fixed at Laplace(l/e), and instead di- 
minish the signal in the dataset by a factor of c. 

We now review the weighted forms of the transfor- 
mations appropriated by PINQ from LINQ (C#'s Lan- 
guage Integrated Queries). 

2.2.7 Select, Where, and SelectMany 

The Select transformation in LINQ takes a function 
/ and applies it to each input record on a record-by- 
record basis. The Where operation in LINQ takes a 
predicate p and retains only those records satisfying the 
predicate. Both Select and Where are special cases of 



LINQ's SelectMeiny transformation, which takes a func- 
tion / mapping each record to a set of output records, 
and returns the concatenation of the sets output by all 
the input records. 

While Select and Where transformations are both 
stable, the stability of SelectMany depends on the max- 
imum number of output records that may result, and 
thus SelectMany does not generally have bounded sta- 
bility. The original PINQ dealt with this by enforcing 
an upper bound of c on the number of records emitted 
by SelectMany, ensuring c-stability regardless of the 
number of records actually emitted. The constant c ap- 
plied to records uniformly, and those who may want to 
emit fewer or more records are not accommodated. In 
contrast. Weighted PINQ no longer needs to suppress 
output records. Instead, it supplies a more powerful 
SelectMany operator that normalizes the weight of the 
set of output records f{x) corresponding to an input 
record x by max(l, ||/(x)||). (||/(a;)|| is the £l-norm of 
the set f{x).) Thus, we can write 



Concat and Except can be viewed as coordinate-wise 
addition and subtraction. 



SelectMany(v4, /) := ^ 



xeD 



A{x) X fix) 
max(l,||/(a;)||) 



This form of SelectMany is stable, as the denominator 
ensures that f{x) is normalized to be at most a unit 
dataset, and provides generalized stable forms of Se- 
lect and Where for weighted datasets. 

2.2.2 Shave 

We now introduce a new stable operator used to dis- 
assemble weighted records into multiple distinct records, 
based on their initial weight. 

The Shave operator applies to a weighted collection, 
and takes in a function from records to a sequence of 
real values f{x) = {wo,wi, . . .). For each record x in 
the input A, Shave produces records (x, 0), (a;,l), ... 
for as many terms as X^i^i — A{x). The weight of 
output record {x,i) is therefore 

inax{0,mm{wi, A(x) — y Wj)) . 

j<i 

Intuitively, Shave shaves x until its associated weight 
A{x) is exhausted. The functional inverse of Shave is 
Select, which can transform each Wi-weighted indexed 
pair from {i, x) to x whose weight re-accumulates to 
'^i'Wi = A{x). Both of these operators can reallocate 
weights across records, while conserving the total weight 
available in the dataset. 

2.2.3 Union, Intersect, Concat, and Except 

Union and Intersect have pleasant weighted inter- 
pretations as coordinate-wise min and max. 

Vnio-D.{A,B){x) = ma.x{A{x),B{x)) 
Intersect (yl,B)(a;) = min{A{x),B{x)) 



Conca.t{A, B){x) 
ExceTpt{A, B){x) 



A{x) + B{x) 
A{x) - B{x) 



Each of these four transformations are stable with re- 
spect to each input, with no requirements of the other 
input. 

Accumulating privacy costs. One important 

consequence is that for binary operators, information 
is revealed about both inputs. If a protected dataset is 
used in both inputs, the degradation of the differential 
privacy guarantee accumulates. PINQ manages this ac- 
cumulated privacy cost by ensuring that all participat- 
ing datasets have sufficient remaining privacy budget 
before executing a query. 

2.2.4 Join 

The Join transformation in LINQ takes two datasets, 
two key selection functions, and a reduction function to 
apply to each pair of records (one from each input) with 
equal keys. Like SelectMany, Join may produce a large 
number of output records, and generally does not have 
bounded stability. 

Like SelectMany, we can produce a stable Join by 
scaling down the weights of the many output records. 
Suppose we Join two datasets A, B, and let A^. and B^ 
be the restrictions of A and B, respectively, to those 
records mapping to a key k under their key functions. 
For every k and for every pair (a, /3) G Ak x Bk , the Join 
operator emits the record reducer (a, P) with weight 



A{a) X B{I3) 



\Bk 



(2) 



Example. The Join transformation is the workhorse 
of non-uniformity in our graph analysis queries; for an 
example of how we use it, refer to Section [672] 

Stability. The division by the sum of the weights 
ensures not only that the weight introduced by a new 
record is bounded, but that the changes in the weights of 
existing records are also bounded. With this weighting, 
we can show that Join is stable with respect to both of 
its inputs: 

Theorem 2.3. For any datasets A, A' , B , 

||Join(A,S)- Join(.4',B)|| < ||yl-yl'|l. 
Proof. Writing Join in vector notation as 

Aky^Bl 

\Ak\\ + \\Bk\\ 
we want to show that 



Join(A, B) 



Ak X Bi 



A'u^Bl 



\B. 



\B. 



<\\Au 



All 



The proof is essentially by cross-multiplication of the 
denominators and tasteful simplification. We apply the 
equality 

ax-by= {a- b)[x + 2/)/2 + (a + b){x - y)/2 

to the left hand side, taking a and b to be the recip- 
rocal denominators, and x and y to be Ak x Bj^ and 
A'j^ X Sj, respectively. Using the triangle inequality, 
submultiplicativity (||^ x B^\\ < ||A||||_B||), and a sub- 
stantial amount of simplification, this bound becomes 



\A,-A',\ 



2x(P,|| + P',|| + ||i?fe||)x||i3,|| 
2x{\\Ak\\ + \\B,\\)x{\\A'J + \\B,\\) 



The numerator is exactly 2 1 1 Aj. 1 1 1 1 A^. 1 1 less than the de- 
nominator, making the fraction at most one. D 

Accumulating privacy costs. The following ex- 
plains why privacy cost accumulates when the same pro- 
tected dataset is used in both inputs of the Join: 

Corollary 2.4. For any datasets A, A', B, B' , 

||Join(yl,B)- Join(yl',B')ll < P - ^'11 + 11^ - ^'11 

Comparison to original PINQ. PINQ has several 
versions of Join that enforce limits on the number of 
records matching each key, effectively deriving stability 
bounds by tampering with the result set. These versions 
have the same defect as SelectMany; result sets that are 
small suffer higher than necessary stability bounds, and 
result sets that are large are damaged to provide the 
privacy guarantee. 

2.2.5 GroupBy 

The GroupBy transformation in LINQ takes a key se- 
lector function, and transforms a dataset into a collec- 
tion of groups, one group for each observed key, con- 
taining those records mapping to the key. This transfor- 
mation is somewhat less clean in the weighted setting, 
owing to our apparent need to replace multiple records 
with disparate weights with a single record reflecting 
their group, whose weight is difhcult to choose. 

Let the key selector function partition A into mul- 
tiple disjoint parts, so that A = X^fe^fc- The result of 
GroupBy on A will be the sum of its applications to each 
of the parts Ak- 

GroupBy(A, f) =Y, GroupBy(Afc) 

k 

For each part A^, let xo,xi, . . . ,Xi be an ordering of 
non-zero elements of A^ non-increasing in Ak{xi). For 
each i, GroupBy emits the set of the first i elements in 
this order, with weight {Ak{xi) — Ak{xi+i))/2. Thus, 
we can write: 

GroupBy(Afc)({a;j : j < i}) = {A{xi) - A{x,+i))/2. 



Example. As an example of GroupBy, consider 

grouping 

A = {("1", 0.75), ("2", 2.0), ("3", 1.0)}. 

The three sets we produce are {2}, {2, 3}, {2, 3, 1}, with 
weights 

GroupBy({2}) = 0.5 
GroupBy({2, 3}) = 0.125 
GroupBy({2, 3, 1}) = 0.375 

Stability. The stability of Join is assured by the 

following theorem: 

Theorem 2.5. For any A, A' and key function f, 

||GroupBy(^, /) - GroupBy(A', /)|| < ||A - A'\\ 

Proof. For each Xi, a change in weight by 6 which 
does not change the ordering, that is 

A{x,+i) < A{x,) + S< A{x,^i) 

causes the weight of {xj : j < i} to decrease by (5/2 and 
the weight of {xj '■ j < i} to increase by 6/2, for a total 
change of at most 6. 

Larger changes must be broken down into a sequence 
of such (5s, where Xi is switched for its neighbor between 
each (with zero change in the output), but the ultimate 
cost is still only the sum of the (5s, resulting in a stable 
transformation. D 

Remark. Our approach for GroupBy can be seen as 
applying the unweighted transformation on a set of un- 
weighted datasets, each of which is then given weight 
Wi — Wi^i. In fact, this is a general approach capable 
of converting any stable transformation on unweighted 
datasets into a stable transformation on weighted datasets. 
Both SelectMciny and Join could have been extended 
this way, resulting in different semantics. 

3. MARKOV CHAIN MONTE CARLO (MCMC) 

Next, we show how to use Markov Chain Monte Carlo 
to fit a synthetic dataset to the noisy answers to our 
Weighted PINQ queries. 

Probabilistic inference. For each query Q, obser- 
vation 5", and dataset A, there is a well-specified prob- 
ability Pr[(5(A) = S] that describes how likely each 
dataset is to have produced an observation we have 
made. It also informs us about the relative likelihood 
that the protected input is in fact the indicated dataset. 
Following I ill , these probabilities can lead us to a most 
likely dataset, or allow us to define a posterior distribu- 
tion over datasets given some prior. While differential 
privacy ensures we are unlikely to produce a dataset 
that matches many of the records, we may expect the 
posterior distribution to focus attention on graphs that 
match the statistical properties of the input dataset. 



The probability that a dataset A G M^ gives rise 
to counts TO G K"* under d count queries Qi : R^ — > 
K, having added Laplace noise with parameter e^ to 
measurement i is 



Pr[(5(A) = to] ex exp 






ei\qi{A) -mil 



The constant of proportionality is the same for all datasets. 
We are therefore most interested in datasets A minimiz- 
ing the cost function ^^ ei\qi{A) ~ mi\. The number of 
datasets is clearly too large to explicitly enumerate and 
evaluate each, and a more savvy approach is called for. 

3.1 Metropolis-Hastings 

The approach we take is the Metropolis-Hastings al- 
gorithm. Metropolis-Hastings is an MCMC algorithm 
that uses (a) a function scoring datasets and (b) a ran- 
dom walk over datasets, to randomly reject (random- 
walk) transitions that lead to datasets with worse scores. 
The result is a new random walk over datasets, where 
the probability of a dataset is proportional to its score. 
Eliding some assumptions, the pseudo-code is: 

var oldState = new StateO; 

while (true) 
{ 

var newState = RandomWalk (oldState) ; 

var oldScore = score (oldState) ; 
var newScore = score (newState) ; 



if (random. NextDoubleO < newScore / oldScore) 
OldState = newState: 



} 



Choosing the random walk. The choice of ran- 
dom walk is an important part of getting Metropolis- 
Hastings to work well. We will simply choose the walk 
that randomly replaces a single element of the dataset 
with a randomly chosen new element, {e.g., The subse- 
quent results (Sections ^]|6| use an edge-swapping ran- 
dom walk that, at each iteration, chooses a random pair 
of edges (ni, 712) and (n'^, rxj) in the synthetic graph, and 
replaces them with edges (711,712) and {n[,n2).) More 
advanced walks exist, but the elided assumptions (about 
reversibility, and easy computation of stationary prob- 
ability) emerge as important details. 
Choosing the scoring function. The choice of 

scoring function also depends on the intent. While pos- 
terior probability is a good candidate, it requires a good 
prior over the datasets; graphs are especially challeng- 
ing, because a uniform prior places almost all of its mass 
on dense graphs, which is not our prior belief about so- 
cial graphs. Instead, we simply raise the cost function 
J2i^i\lii^) ~ '^i\ to a high power, essentially doing a 
greedy search for the most likely graph matching the an- 
swers to our Weighted PINQ queries. We are currently 
investigating a more principled approach. 



3.2 Implementation 

The evaluation of the score function in each itera- 
tion of MCMC involves computing the correct answers 
to all of the involved queries on input newState. As 
the queries can be almost arbitrary LINQ statements, 
a single evaluation of the score function could involve a 
substantial amount of work. 

Exploiting stability. Instead, we observe that 

an important property of PINQ transformations, sta- 
bility, also results in a compelling computational prop- 
erty. Recall that stability requires that slight changes 
in the inputs to a transformation result in only slight 
changes in the output. As each iteration of the Markov 
chain makes only slight changes to the input dataset 
(swapping records in or out), stability implies that each 
intermediate dataset in the query also experiences rela- 
tively few changes as a result of the change to the input, 
all the way through to the measurements. 

Incremental query evaluator. Following this in- 
tuition, we have implemented an incremental form of 
the PINQ query evaluator. Each operator is capable of 
responding to the addition or deletion of weight, pro- 
ducing an addition or deletion of weight in it output. As 
detailed in Appendix [^ most transformations require 
holding some state in memory, proportional to the num- 
ber of distinct records in the dataset, and computational 
effort proportional to the number of produced updates. 
Depending on the query's complexity, this implemen- 
tation allows us to process up to tens of thousands of 
candidate graphs per second. As the queries become 
more complicated the process slows, and we require ei- 
ther more time or more resources (our implementation 
is data-parallel). 

4. CREATING THE SEED GRAPH 

We are now ready to begin our description of our pro- 
cess of synthesizing graphs. We start by showing how 
we create a "seed" synthetic graph, that will (later) be 
used as a starting point for MCMC. Our seed synthetic 
graph is generated from Weighed PINQ queries related 
to the degree distribution. 

4.1 Degree CCDF 

We present a Weighted PINQ query for computing 
the complementary cumulative distribution function 
(CCDF) of node outdegree. Starting from edges, the 
secret dataset of unit- weight directed edges (tIq, tij), do: 

var deqCCDF = edges .Select (edge => edge.src) 
.Shave (1.0) 
.Select((index, srcname) => index); 

var ccdf Counts = degCCDF.NoisyCount (epsilon) ; 

The steps of the query are depicted in the figure above. 
We start by transforming the dataset so that each record 



degSeq! 



is a node's name n-,-, weighted its outdegree dj. Next, we 
shave each rij record into dj unique, unit-weight pairs: 
{rij, 0), (uj, 1), ..., (nj,dj — 1). By keeping only the in- 
dex of the pair, we obtain records i = 0, 1, 2, ..., dmax — 1, 
each weighted by the number of nodes in the graph with 
degree greater than i. FinaUy, taking a noisy sum the 
weight of each record gives the outdegree CCDF. (Un- 
surprisingly, replacing the first line of our algorithm 

with edge. Select (edge => edge.tgt) WOuld result in the 

indegree CCDF.) We note that both Shave and Select 
do not scale down record's weight; it follows that our 
CCDF query provides e-differential privacy while pre- 
serving all of the weight in our edges dataset. 

4.2 Degree Sequence 

The degree CCDF is the functional inverse of the de- 
gree sequence^ i.e., the monotonically non-increasing se- 
quence of node degrees in the graph, i.e., di,d2, ...,(in 
such that di > di+i. To get the degree sequence, we 
need only transpose the x- and y-axis of a plot of the 
degree CCDF. We can do this in weighted PINQ with- 
out scaling down the weight of any of our records: 

var deqSeq = degCCDF . Shave ( 1 . 0) 

.Select ((index, degree) => index); 

var degSeqCounts = degSeq.NoisyCount (epsilon) ; 

This query, illustrated in the figure above, is actually 
a Weighted PINQ implementation of an e-differential 
privacy algorithm proposed by Hay et al. [SJ! 

4.3 Relating the degree sequence & CCDF 

Hay et al. observed that a significant amount of noise 
can be "cleaned up" in the degree sequence by using iso- 
tonic regression (because the degree sequence is known 
to be non- increasing) [3lf6|. We observe that the same is 
true for the CCDF, and moreover the degree sequence 
and CCDF give accurate information about different 
aspects of the graph: the former accurately reports the 
graph's highest degrees, whereas the latter (its trans- 
pose) better reports the numbers of low degree nodes. 
While PAVA [6], and independently K, can regress ei- 
ther the CCDF or the degree sequence to a consistent 
non-increasing sequence, we present a regression tech- 
nique based on shortest paths that finds a single se- 
quence optimizing the pair of measurements, producing 
a degree sequence that is accurate for both the high- 
and low-degree nodes. 





Nodes 


Edges 


Assort 'y 


max iDcg 


max oDeg 


AS Graph 

Collab. 

WikiVote 


14233 
5242 
7115 


32600 
28980 
103689 


-0.306 
+0.659 
-0.068 


172 

81 

2381 


2389 

81 
893 



Table 1: Original (secret) graph statistics 

Our regression teciinique. Letting N be the num- 
ber of nodes in the graph, we can think of the space 
of non-increasing degree sequences as drawing a path P 
on the two-dimensional integer grid from position (0, N) 
to position {N, 0) that only step down or to the right. 
Our goal is to find such path P that fits our noisy degree 
sequence and noisy CCDF as well as possible; more con- 
cretely, thinking of P as a set of cartesian points {x,y), 
and given the noisy "horizontal" ccdf measurements h 
and the noisy "vertical" degree sequence measurements 
V, we want to minimize 



^ \v[x\~y\ + \h[y\ 



(3) 



To do this, we shall weight edges in the two-dimensional 
integer grid as 

cost{{x,y) ^ {x + l,y)) = \v[x] ~ y\ 
cost{{x,y + l)^{x,y)) = \h[y] - x\ . 

and compute the lowest cost path from (0, N) to {N, 0). 
To see how this works, notice that the cost of taking 
a horizontal step from {x,y) to {x + l,y) is a func- 
tion of the "vertical" degree sequence measurement v; 
wc do this because taking a horizontal step essentially 
means that we are committing to the vertical value y. 
(And vice versa for the vertical step.) Thus, finding 
the lowest-cost path allows us to simultaneously fit the 
noisy CCDF and degree sequence while minimizing pi). 
We note that this lowest-cost path computation is 
more expensive than the linear time PAVA computa- 
tion. However, our algorithm constructs edges only 
as needed, and the computation need only visit nodes 
in the low cost "trough" near the true measurements, 
which reduces the complexity of this computation. 

4.4 Results: Initial synthetic graphs. 

We used a version of the dKl graph generator to 
generate a random graph that fits (1) our "cleaned- 
up" non-increasing outdegree sequence (that has privacy 
cost 2e since it generated from two e-dp measurements, 
namely, degree sequence and CCDF) and (2) indegree 
sequence as well as (3) an e-dp measure of the number 
of nodes in the graph (see query in Appendix [B|) . Set- 
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Figure 1: Degree Sequence, ARIN AS graph. 
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Figure 2: Degree Sequence, Collab. Graph. 

ting e = 0.1, these synthetic graphs use a total privacy 
cost of 5e = 0.5. 

We present resuhs for two graphs; a graph of au- 
tonomous systems in the ARIN region [l] (available 
on our project website), the Arxiv GR-QC collabora- 
tion graph, and the wiki-vote graph from the Stanford 
repository (tI. Statistics about the original graphs are 
in Table [T] In Figures [T] [2J |3J we plot measured de- 
gree sequences, both before regression and after regres- 
sion, and compare them to the degree sequence of the 
actual protected graph. Regression removes most of 
the noise in the measured degree sequence; the nor- 
malized root-mean-square-error (RSME) after regres- 
sion for each graph is less than 1%. 

5. MULTIEDGES AND SELF-LOOPS 

Figures [T] and [2] reveal that our seed graphs contain 
self- loops {i.e., edges from a node to itself) and a large 
number of multiedges {i.e., repeated edges). By writ- 
ing a Weighted PINQ query to measure the number of 
multiedges and self-loops, we can cause MCMC to pre- 
fer graphs that respect their absence. 

The following query first uses Shave to report the 
multiplicity of each edge, and then Select to classify 
each as either self- loop or not. The numbers of each 
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Figure 3: Degree Sequence, Wiki-Vote graph. 
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type are then counted, with noise. Since we only use 
Select and Shave, this query is e-differentially-private 
without scaling record weights: 



var multi = edges . Shave (1 . 0) 

.Selected, e) => new 

PairCi, e.src == 

.NoisyCouiit(epsilon) ; 



.tgt ? 1 : 0)) 



The query is illustrated in the figure above. We start by 
shaving each edge, so that for every edge e of multiplic- 
ity i, we obtain the pairs (e, 0), (e, 1), ..., (e, i— 1). Next, 
we transform these pairs (e, i) so that if edge e is a self- 
loop {i.e., e = (^1,7^2) where ni = 112)), we replace e 
with the boolean 1; otherwise, e is not a self loop, and 
we replace it with 0. Our dataset now consists of pairs: 



(0,0),..., (0,z),.. 
(0,1),..., (i,j),.. 



,(0,to'-1) 
,(l,rn- 1); 



where m is the maximum multiplicity of self-loop edges 
in the graph {resp., m! is the maximum multiplicity 
of non-self loops). Moreover, the weight of pair (1, j) 
corresponds to the number of self-loops in the graph 
with multiplicity at least j {resp., (0, i) has weight cor- 
responding to the number of non-self loops with multi- 
plicities at least i.) 

We fed our seed synthetic graph into MCMC, and let 
it correct the self-loops and multiplicity for us. MCMC 
was consistently able to remove all of the extra multi- 
edges and self loops in the seed graphs depicted in Fig- 
ure [2] and [T] Our edge-swapping MCMC algorithm does 
not alter the in- and out-degree distributions, and thus 



can only improve the fit to the measurements by cor- 
recting self-loop and multi-edges. Note that we added 
an e = 0.1 edge- multiplicity query to those used to gen- 
erate our seed graphs, increasing privacy cost to 6e. 

6. JOINT DEGREE DISTRIBUTION 

Our next objective is to correct the assortativity of 
our synthetic graph. The assortativity can be computed 
directly from the joint degree distribution (JDD), and 
gives a measure of degree correlations. Assortativity 
is high (close to -1-1) if nodes tend to be connected to 
nodes of similar degree, low (close to -1) if the opposite 
is true, and w in a random graph. If E is the set of 
edges (tt-o, rii) in the graph, and do, di are the degrees of 
node no,ni respectively, then assortativity is defined as 
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Fortunately, we need not measure the JDD directly. In- 
stead, we measure a property that is indirectly related 
to the JDD, thus forcing MCMC to fit the synthetic 
graph to the assortativity of the original graph. 

Roadmap. As observed by 13 and discussed in Sec- 
tion |1.1| and Appendix [C] obtaining an accurate mea- 
sure of the JDD requires that we exploit non-uniform 
noise. Thus, the next set of weighted PINQ queries we 
discuss make heavy use of the Join transformation, that 
scales down record weight in interesting ways (unlike the 
Shave and Select transformations that we've seen so 
far), and allows us to emulate non- uniform noise. 

6.1 Using the GroupBy transformation 

Our JDD query uses the GroupBy transformation, 
which we describe in detail in Section 12.2.51 For our 
purposes, on integral datasets this transformation takes 
a function mapping records to key values and a function 
from a set of records to a result value, and for each ob- 
served key emits a pair with weight 0.5 containing the 
key and the function applied to the corresponding set 
of records. 

We can obtain a dataset containing (node, indegree) 
pairs, each of weight 0.5, by taking 

var IDegs = edges.GroupByCe => e.tgt, 1 => l.CountO); 

Importantly, these counts are taken without noise. The 
results of GroupBy are still protected data, and may 
only be examined through noisy aggregation. 

6.2 Using the Join transformation 

Next, we show how to use the Join transformation, 
allowing us to scale down record weights in interesting, 
non-uniform ways. As an example, consider applying 
the Join to our edges dataset and the iDegs dataset 
produced above: 
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Figure 4: JDD of Collab., 1.0-difFerential privacy 

var iDegEdges = 

edges. Join (iDegs, edge => edge.tgt, ideg => ideg.node, 
(edge, ideg) => new Pair (edge, ideg.deg)) 

If the d.i{v) in- neighbors of v are mi, . . . Uci.i^\, then the 
restrictions of edges and iDegs to key v are: 

edges, = {("(ui,i>)", 1), . . . , ("(Urf^(,),«)", 1)} 
iDegs, = {("(i;,d,(«))", 0.5)} 

The outputs for each key v are therefore 

'\{u^,v),d,{v)y\%{u2,v),d,{v)y\...,'\{ud^,v),d,{v)Y . 

We have ||edges,|j — di{v) and ||iDegs,|| = 0.5, so the 
the weight of every output element equals 

0.5/{d,iv) + 0.5) = l/i2d,{v) + 1) . 

Note that the result uses edges twice (iDegs derives 
from edges) so using the dataset will cost twice the e. 

6.3 A direct measure of the JDD. 

Given the (edge, indegree) dataset iDegEdges de- 
scribed above, we form the analogous (edge, outdegree) 
dataset oDegEdges, and join the two together to obtain 
the (outdegree, indegree) dataset required for the JDD: 

var jdd = 

oDegEdges. Join(iDegEdges, o => o.edge, i => i.edge, 
(i, o) => new Pair(o.deg, i.deg)); 

var jddCount = jdd.NoisyCount (epsilon) ; 

For each edge {u,v), the weighted sets oDegEdges/ ,^ 
and iDegEdges, ,-, both contain single elements, with 
weights {2do{u) + 1)"^ and {2di{v) + 1)^^, respectively. 
The resulting output record, {do{u) , di{v)) , has weight 
{2do{u) + 2di{v) + 2)~^ Taking a NoisyCount of the 
jdd dataset provides a 4e-differential privacy measure 
of the JDD (as each input uses edges twice). This cor- 
responds to an e-differential privacy measurement of the 
true count of (di, ^2) edges with noise 8(di + ^2 + l)/e, 
similar to (but worse than) the 4max(di, di)le from the 
custom analysis of [13]. 



6.4 An indirect measure of the JDD. 

Noise of magnitude 8(di + ^2 + l)/e in counts that 
are often zero and rarely much more can be a serious 
problem; even 4max((ii, ^2) was found to be too much 
to add to all entries in flJ'. At this point, the combina- 
tion of Weighted PINQ and MCMC shines. Instead of 
directly computing a NoisyCount on the jdd dataset, 
we can significantly improve our signal-to-noise ratio by 
first bucketing degrees based on information from our 
degree distribution measurements, and then taking a 
NoisyCount on the larger weight in each bucket: 

var jddBucketCount = jdd.SelectCx => bucket(x, buckets)) 
. NoisyCount (epsilon) ; 

The bucket function takes the pair (do{u),di(v)) to a 
pair {x,y) corresponding to the indices of the buckets 
the degrees land in. Weighted PINQ guarantees dif- 
ferential privacy (we don't need a new analysis) and 
MCMC focuses on graphs with the same contributions 
to each bucket (we don't need any new post-processing). 
All that we need to do is see how well it works. 

Bucketizer. Because weighted PINQ's Select al- 
lows for arbitrary transformation of records, we wrote 
a custom bucket function using ordinary LINQ code. 
We create n x n buckets by quantizing the set of in- 
degrees present in our measured indegree CCDF (after 
regression) into n equally-sized groups, and similarly for 
the outdegree. Then, bucketize maps pair {do,di) to 
bucket {x, y) if do corresponds to x-th smallest outde- 
gree group and di corresponds to y-th smallest indegree 
group. Our buckets are created as follows: 

createBucketsCoCCDF, iCCDF, n) 
{ 

//define values to use to split CCDFs 
for (int i = 0; i < n; i++) 

percentage [i] = 1 - (i + 1) / n; 

//split oCCDF using percentage values 
index = degree = 0; 
foreach (var nodes in oCCDF) 
{ 
if (nodes <= oCCDF[0] * percentage [index] ) 
{ 

oVal [index] = degree ; 
index++ ; 
} 

degree++; 

if (index == n) break; 
} 

//... same for iCCDF to get iVal 

//use oVal and iVal to create buckets 

pos = 0; 

for (int i = 0; i < partitions; i++) 

{ 

for (int j = 0; j < partitions; j++) 
{ 

buckets [pos] .a = oVal[i]; 
buckets [pos] .b = iVal[j]; 
pos++; 



return buckets; 



6.5 Results: Fitting the assortativity. 

The assortativity of our seed AS graph (Figure [2]) 
is already quite close to that of the original AS graph 
(Table [1]). Therefore, we only present results for the 
collaboration graph, generated using the six previously- 
discussed queries of privacy cost e, as well as a JDD 
query with cost 4e, for total privacy cost of lOe = 1. 

Figure [4], We plot the "scaled-down JDD" where the 
count in each pair {do,di) is scaled by {2do + 2di + 2)~^. 
We compare the measured value (jddCount) to that 
of the original graph, and the synthetic graph after 
2M iterations of MCMC. (For simphcity, we plot the 
degree product dgdi on the x-axis instead of the two- 
dimensional degree pairs {do,di)). While our measured 
JDD is extremely noisy for the high degree pairs, it 
has much better accuracy for the low degree pairs. If 
we use MCMC to combine the noisy JDD measurement 
with our highly-accurate degree distribution measure- 
ments (Figure [I]), we clean up most of this noise; our 
scaled-down JDD has normalized RSME of only 0.01 
(averaging five trials) after MCMC. 

Figure [5} Next, we consider the effect of bucketizing. 
We plot the assortativity of our synthetic graph versus 
the number of iterations of MCMC, for different choices 
of buckets. We plot the mean of five experiments and 
their relative standard deviation. Our synthetic graphs 
start out with assortativity « 0, and climb towards the 
target assortativity of 0.65 as MCMC proceeds. More- 
over, there seems to be a 'happy-medium' for bucketiz- 
ing; with too few buckets, we lose too much informa- 
tion, and with too many buckets (or none at all), our 
signal-to-noise ratio is too low. 

Table [2} We show the number of multiedges and self 
loops for our seed synthetic graphs, and after 2M itera- 
tions of MCMC (averaging across the same five trials for 
each choice of buckets used in Figured. The true col- 
laboration graph has no multiedges and 12 self-loops. 
Because MCMC is trying to fit edge multiplicity and 
JDD simultaneously, it does not completely eliminate 
spurious multiedges, but it does come close. 

For this configuration, using 31^ = 961 buckets works 
best; averaging five trials we obtain assortativity 0.62, 
and normahzed RSME of 0.003 and 0.009 in the de- 
gree CCDF and "scaled-down JDD" respectively, with 
an average of 6 spurious multiedges and 10 self-loops 
(the real collaboration graph has 12 self loops). Note 
that different graphs or choices of the privacy param- 
eter e may require different bucketizing strategies; we 
are characterizing this as part of our ongoing work. 
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7. RELATED WORK 

Since the introduction of differential privacy [2], nu- 
merous bespoke analyses have emerged for specific prob- 
lems, as well as general tools such as PINQ (Sl. The 
bulk of this work has focused on statistical analyses 
of tabular data. Although graph-structure data can 
be viewed in a tabular form (each edge has a "source" 
and "destination" attribute) , graph queries typically re- 
sult in many Join operations over the tables, requir- 
ing excessive amounts of additive noise using standard 
tools fll]. Bespoke analyses have recently emerged for 
degree distributions [s], joint degree distribution (and 
assortativity) [l3], triangle counting [9|, and some gen- 
eralizations of triangles [S] . We can provide analogues 
of each of these approaches in Weighted PINQ, typically 
matching proven bounds (within constants) and always 
exploiting the non-uniformities described in Section [TTT] 

Many graph analyses satisfy privacy definitions other 
than differential privacy 14^ . These definitions generally 
do not exhibit the robustness of differential privacy, and 
a comparison is beyond the scope of this note. 

8. FUTURE WORK 

In this short paper, we provided an overview of our 
new workflow for differentially-private graph synthesis, 
and presented initial results of our experiments to gen- 
erate synthetic graphs that fit the degree distribution, 
edge multiplicity, and assortativity of the protected graph. 
But this is only the first step of our efforts; our cur- 
rent work involves developing improved queries of these 
graph statistics, and expanding to new ones, including 
triangles, clustering coefficient, motifs, and more. 



[3; 
[4; 

[5 

[6; 

[7; 

[s; 

[9; 

[lo: 

[11 

[12; 
[13; 

[14 



REFERENCES 

Ying-Ju Chi, Ricardo Oliveira, and Lixia Zhang. Cyclops: 

The Internet AS-level observatory. ACM SIGCOMM CCR, 

2008. 

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam 

Smith. CaUbrating noise to sensitivity in private data 

analysis. In Theory of Cryptography, volume 3876 of 

Lecture Notes in Computer Science, pages 265-284. 2006. 

M. Hay, Chao Li, G. Miklau, and D. Jensen. Accurate 

estimation of the degree distribution of private networks. In 

IEEE ICDM '09, pages 169 -178, dec. 2009. 

Michael Hay, Kun Liu, Gerome Miklau, Jian Pei, and 

Evimaria Terzi. Tutorial on privacy-aware data 

management in information networks. Proc. SIGMOD'll, 

2011. 

Vishesh Karwa, Sofya Raskhodnikova, Adam Smith, and 

Grigory Yaroslavtsev. Proc. vldb'll. pages 1146—1157, 2011. 

Jan De Leeuw, K. Hornik, and P. Mair. Isotone 

optimization in r; Pool-adjacent-violators algorithm (pava) 

and active set methods. Journal of statistical software, 

32(5), 2009. 

J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph 

evolution: Densification and shrinking diameters. ACM 

Trans. KDD, 1(1), 2007. 

Frank D. McSherry. Privacy integrated queries: an 

extensible platform for privacy-preserving data analysis. In 

Proc. SIGMOD '09, pages 19-30, 2009. 

Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 

Smooth sensitivity and sampling in private data analysis. 

In ACM STOC '07, pages 75-84, 2007. 

Davide Proserpio, Sharon Goldberg, and Frank McSherry. 

A workflow for differentially-private graph synthesis. 

Technical report. 

Vibhor Rastogi, Michael Hay, Gerome Miklau, and Dan 

Suciu. Relationship privacy: output perturbation for 

queries with joins. In Proc. PODS '09, 2009. 

Indrajit Roy, Srinath T. V. Setty, Ann Kilzer, Vitaly 

Shmatikov, and Emmett Witchel. Airavat: security and 

privacy for mapreduce. In Proc. USENIX NSDI'IO, pages 

20-20. USENIX Association, 2010. 

Alessandra Sala, Xiaohan Zhao, Christo Wilson, Haitao 

Zheng, and Ben Y. Zhao. Sharing graphs using 

differentially private graph models. In IMC, 2011. 

Oliver Williams and Frank McSherry. Probabilistic 

inference and differential privacy. Proc. NIPS, 2010. 



APPENDIX 

A. INCREMENTAL QUERY EVALUATOR. 

The following provides the details of our implemen- 
tation of incremental form of the PINQ query evaluator 
for the various transformations in Weighted PINQ. 

Select, Where, and SelectMany. Select, Where, 
and SelectMany are linear transformations; 

SelectMany(v4 + B) = SelectMany(A) -|- 

SelectMany(i?) 

If additional records appear in their inputs, we need 
only process the records and emit their results, per- 
forming an amount of work proportional to the number 
of updates that must be produced. 

Union, Intersect, Concat, and Except. Both 

Union and Intersect require some state to be able 
to track the maximum and minimum weight for each 
record between the two inputs. We simply maintain a 
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table of weights for each input, updated when updates 
arrive from either input, and resulting in further up- 
dates if the maximum or minimum has changed. Both 
Concat and Except are stateless, in that each change of 
weight may simply be passed along to the consumers of 
the datasets (negated, for Except's second argument). 

Join. The Join transformation involves a relatively 
complicated weighting of the output records. We will 
keep the list of weighted records associated with each 
key, for both datasets. When an update arrives, we can 
determine from the previous cross product and sums, 
and then new cross product and sums, the incremen- 
tal change in weight to output records. The amount of 
work performed is proportional to the number of up- 
dates produced. Key groups that did not receive up- 
dates produce no updates and incur no computation. 

Shave. Shave maintains the largest index associated 
with each input record, as well as the weight associated 
with that index. To process a change in the weight 
of an input. Slice only needs to apply that change to 
the weight of the current index, reacting appropriately 
if the weight exceeds the maximum weight of the slice 
(causing the index to increase) or decreases below zero 
(causing the index to decrease). 

GroupBy. For each key, GroupBy keeps the asso- 

ciated input records in non-decreasing order of A{x). 
In response to a change in weight, GroupBy moves the 
changed record forwards or backwards in the list, emit- 
ting the necessary changes to the output, and doing only 
as much work as there are records to emit. 

Noisy Count. Updating most aggregations is rela- 
tively easy, and this is especially true of NoisyCount. 
It is important to note that we arc using the incre- 
mental evaluation for non-private inputs, and so it is 
entirely appropriate for NoisyCount to directly report 
the changed counts and how their new values affect the 
quality of approximation. Each can do this by main- 
taining the current count and the measured count, and 
each time the former changes determine the change in 
absolute value difference between the two, scaled by e. 

B. EXTRA QUERIES AND ALGORITHMS 

The following two algorithms belong in our seed syn- 
thetic graph generation procedure. 

B.l Number of nodes 

Because we view our graph as a set of directed edges, 
we need one more differential privacy measurement be- 
fore we can apply a simple graph generator that will 
fit a random graph to our cleaned-up in- and outdegree 
sequences: 



.WhereCi => i == 0) 
. NoisyCount (epsilon) ; 

var numNodes = (int) iioisyNumNodes*2; 

We start by using SelectMany to convert the dataset 
from a list of edges of unit weight, to a list of nodes 
Hi each with weight proportional to |(indegree(ni) -I- 
outdcgrec(ni)). (The factor of ^ follows because Se- 
lectMany transforms each unit-weight edge record into 
two node records, and equally divides the weight be- 
tween them.) Next, we shave each rij record into dj 
unique, ^-weight pairs: namely {rij, 0), {rij, 1), ..., {uj, dj~ 
1). We then use Where to restrict ourselves to pairs in- 
dexed by (0, *); the total weight of these pairs is to \ 
times the number of nodes in the graph. Finally, tak- 
ing a noisy sum of all these pairs, multiplying it by 2, 
and flooring the result using an int cast, we obtain an 
e-differentially-private measure of the number of nodes. 

B.2 dKl graph generation algorithm. 

The following algorithm takes in a "cleaned-up" out- 
degree sequence oDegSeq and indegree sequence oDegSeq, 
and the measured value numNodes, and creates a syn- 
thetic random graph that fits these measurements. This 
algorithm accommodates situations where the number 
of nodes that have non-zero outdegree (resp., indegree) 
is not equal to numNodes. 

The algorithm also uses the following values, which 
are easily extracted from oDegSeq and iDegSeq: (1) the 
number of edges in the graph numedge, (2) the number 
of nodes in the outdegree sequence onodes and indegree 
sequence inodes. 



var oedges = oDeqSeq.SelectMany((deg, 
. Permute ; 



i) => Enum. Repeat (i, deg)) 



var namesl = Enum. Range (onodes, nodes - onodes) 
. Permute ; 

var naines2 = Enum. Range (0, inodes) 
. Permute ; 

var iNames = namesl .Concat (names2) ; 

var ledges = IDegSeq. Permute () 

.SelectMany ((deg, 1) => 

Enum.Repeat(lNames[l] , 
. Permute ; 

for (Int j = 0; j < numedges; j++) 

yield new Edge (oedges [j] , ledges[j]); 



var nolsyNodes = edges. SelectMany (e => new{ e.src, e.tgt }) 
.Shave (0.5) 
.Select ((Index, node) => Index) 
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C. REPRODUCTION OF SALA et al. 'S RESULT: NON-UNIFORM NOISE FOR THE JDD 

Claim C.l. Let D be the domain of possible node degrees, i.e., D = {0, 1, ...,dinax}- The following mechanism is 
e- differentially private: for each pair {di, dj) € D x D, release a count of number of edges incident on nodes of degree 
di and degree dj, perturbed by zero-mean Laplace noise with parameter 4max{(ii, djj/e. 

Proof. We need to show that for two graphs Gi and G2, differing in a single edge (a, 6), that the mechanism 
releasing all pairs {di,dj) with Laplace noise of with parameter 4niax{di,(ij}/e satisfies e-differential privacy. 
Notation. This algorithm works on undirected edges, so that the pair {di,dj) = (dj,di). hetn{i,j) = 4inax{di,(ij}. 
Let ti{i,j) be the true value count of pair {di, dj) in graph Gi, and respectively t2{i,j) for G2. 

If we add Laplace noise n{i,j) to each true count ti(i,j),t2{i,j) the probability that we get a result where the 
{di, (ij)-th pair has values r{i,j) is proportional to 

Pr[M(Gi) = r] (X []exp (-|r(i, j) - t,{i,j)\ x e/n{i,j)) 



Fixing the output r, we are interested in the ratio of this probability for Gi and G2. Fortunately, the constant of 
proportionality is the same for the two (because the Laplace distribution has the same integral no matter where it 
is centered), so the ratio of the two is just the ratio of the right hand side above: 

Pr[Af (Gi) ^ r' 



Pr[M(G2) 



]^exp((|r(i,j) -t2{i,j)\ - \r{i,j)-ti{i,j)\) x e/n{i,j)) 



i-,3 



< 



J|exp(|t2(«, j) - ti{i,j)\ X e/n{i,j)) (triangle inequality) 



exp(e X V \t2{i,j) - ti{i,j)\ /n{i,j)) 



hi 



Thus, it suffices to show that 



^|t2(*,i)-ti(i,i)IMz,j)<i 



(4) 



»j 



Notice that we need only worry about cases where t2{i,j) 7^ ti{i,j). Suppose that Gi contains the edge (a, b), while 
G2 does not, and wlog assume that da > dh, where da and db are the degrees of node a and b in graph Gi. The 
differences between ii and ^2 are as follows: 

• The count of pair ti{da, df,) is one higher than t2{da, db). 

• Ignoring the (a, b) edge, which we already accounted for above, node a has da — 1 other edges incident on it; it 
follows that there are at most da — 1 pairs {da, *) that are one greater in Gi relative to G2. Similarly, there are 
at most db — 1 pairs {db, *) that are one greater in Gi relative to G2. 

• Furthermore, since the degree of node a in G2 is da — 1, it follows that there are at most da — 1 pairs {da — 1, *) 
that are one greater in G2 relative to Gi. Similarly, there are at most df, — 1 pairs {db — 1, *) that are one greater 
in G2 relative to Gi. 

Notice that the total number of differences, in terms of the degree of a, b in Gi (before the edge is removed) is 
2(da — 1) + 2(da — 1) + 1 = 2da + 2db — 3. Note that if we express this in terms of the degree of a, b in G2, we replace 



da with da 
becomes: 



1 and similarly for db, so we get a total of 2da + 2db + 1 as in 13 . It follows that the left side of Q 

da - 1 , da - 1 



^\t2{i,j)~h{i,j)\ 



n{i,j) 



< 



1 



dh-l 



dft -1 



n{da,db) n{*,da) n(*,da-l) n{*,db) 



< 



1 
4 max{ da, db} 



< h2-^^ — 

4d„ 4d„ 



dg - 1 dg - 1 
4da 4da - 4 
. d„ - 1 



db-1 
^db 



n{*,db - 1) 

(since n{*,da) > 4da) 



db-1 
4db-4 



4d„ 



(since da > db 



= 1 
< 1 



1 

4da 



which is what we set out to show in H. D 



13 



